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Abstract. We study the time-bounded reachability problem for continuous-time 
Markov decision processes (CTMDPs) and games (CTMGs). Existing techniques 
for this problem use discretisation techniques to break time into discrete inter- 
vals, and optimal control is approximated for each interval separately. Current 
techniques provide an accuracy of 0(e 2 ) on each interval, which leads to an 
infeasibly large number of intervals. We propose a sequence of approximations 
that achieve accuracies of 0(e 3 ), 0(e 4 ), and 0(e 5 ), that allow us to drastically 
reduce the number of intervals that are considered. For CTMDPs, the perfor- 
mance of the resulting algorithms is comparable to the heuristic approach given 
by Buckholz and Schulz | 6|, while also being theoretically justified. All of our 
results generalise to CTMGs, where our results yield the first practically imple- 
mentable algorithms for this problem. We also provide positional strategies for 
both players that achieve similar error bounds. 



1 Introduction 

Probabilistic models are being used extensively in the formal analysis of complex sys- 
tems, including networked, distributed, and most recently, biological systems. Over the 
past 15 years, probabilistic model checking for discrete-time Markov decision pro- 
cesses (MDPs) and continuous-time Markov chains (CTMCs) has been successfully 
applied to these rich academic and industrial applications [9 8 1 333). However, the the- 
ory for continuous-time Markov decision processes (CTMDPs), which mix the non- 
determinism of MDPs with the continuous-time setting of CTMCs, is less well devel- 
oped. 

This paper studies the time-bounded reachability problem for CTMDPs and their 
extension to continuous-time Markov games, which is a model with both helpful and 
hostile non-determinism. This problem is of paramount importance for model checking 
applications J5). The non-determinism in the system is resolved by providing a sched- 
uler. The time -bounded reachability problem is to determine or to approximate, for a 
given set of goal locations G and time bound T, the maximal (or minimal) probability 
of reaching G before the deadline T that can be achieved by a scheduler. 

Early work on this problem focused on restricted classes of schedulers, such sched- 
ulers without any access to time in systems with uniform transition rates [1|. Recently 
however, results have been proved for the more general class of late schedulers [15|, 
which will be studied in this paper. The different classes of schedulers are contrasted by 



NeuhauBer et. al. iTffl . and they show that late schedulers are the most powerful class. 
Several algorithms have been given to approximate the time-bounded reachability prob- 
abilities for CTMDPs using this scheduler class 05171151181 . 

The current state-of-the-art techniques for solving this problem are based on differ- 
ent forms of discretisation. This technique splits the time bound T into small intervals 
of length £. Optimal control is approximated for each interval separately, and these ap- 
proximations are combined to produce the final result. Current techniques can approxi- 
mate optimal control on an interval of length e with an accuracy of 0(e 2 ). However, to 
achieve a precision of ir with these techniques, one must choose e ss n/T, which leads 
to 0(T 2 /ir) many intervals. Since the desired precision is often high (it is common to 
require that tt < 10~ 6 ), this leads to an infeasibly large number of intervals that must 
be considered by the algorithms. 

A recent paper of Buckholz and Schulz [6 1 has addressed this problem for prac- 
tical applications, by allowing the interval sizes to vary. In addition to computing an 
approximation of the maximal time-bounded reachability probability, which provides a 
lower bound on the optimum, they also compute an upper bound. As long as the up- 
per and lower bounds do not diverge too far, the interval can be extended indefinitely. 
In practical applications, where the optimal choice of action changes infrequently, this 
idea allows their algorithm to consider far fewer intervals while still maintaining high 
precision. However, from a theoretical perspective, their algorithm is not particularly 
satisfying. Their method for extending interval lengths depends on a heuristic, and in 
the worst case their algorithm may consider 0(T 2 /ir) intervals, which is not better than 
other discretisation based techniques. 



Our contribution. In this paper we present a method of obtaining larger interval sizes 
that satisfies both theoretical and practical concerns. Our approach is to provide more 
precise approximations for each e length interval. While current techniques provide an 
accuracy of 0(e 2 ), we propose a sequence of approximations, called double e-nets, 
triple e-nets, and quadruple e-nets, with accuracies 0(e 3 ), 0(e 4 ), and 0(e 5 ), respec- 
tively. Since these approximations are much more precise on each interval, they allow 
us to consider far fewer intervals while still maintaining high precision. For example, 
Table Q] gives the number of intervals considered by our algorithms, in the worst case, 
for a normed CTMDP with time bound T = 10. 



Technique 


Error 


TV = 10" 7 


tt= wr 9 


7T=10- U 


Current techniques 


0{e 2 ) 


1,000,000,000 


100, 000, 000, 000 


10, 000, 000, 000, 000 


Double e-nets 


O(e') 


81,650 


816, 497 


8, 164, 966 


Triple e-nets 


0(e 4 ) 


3,219 


14, 939 


69, 337 


Quadruple e-nets 


0(e b ) 


605 


1,911 


6,043 



Table 1. The number of intervals needed by our algorithms for precisions 10 7 , 10 
and 10 -11 . 



Of course, in order to become more precise, we must spend additional computa- 
tional effort. However, the cost of using double e-nets instead of using current tech- 
niques requires only an extra factor of log \S\, where E is the set of actions. Thus, in 
almost all cases, the large reduction in the number of intervals far outweighs the ex- 
tra cost of using double e-nets. Our worst case running times for triple and quadruple 
e-nets are not so attractive: triple e-nets require an extra \L\ ■ \E 2 \ factor over double e- 
nets, where L is the set of locations, and quadruple e-nets require yet another \L\ ■ \E 2 \ 
factor over triple e-nets. However, these worst case running times only occur when the 
choice of optimal action changes frequently, and we speculate that the cost of using 
these algorithms in practice is much lower than our theoretical worst case bounds. Our 
experimental results with triple e-nets support this claim. 

An added advantage of our techniques is that they can be applied to continuous- 
time Markov games as well as to CTMDPs. Buckholz and Schulz restrict their analysis 
to CTMDPs. Therefore, to the best of our knowledge, we present the first practically 
implementable approximation algorithms for the time-bounded reachability problem in 
CTMGs. Each approximation also provides positional strategies for both players that 
achieve similar error bounds. 

2 Preliminaries 

Definition 1. A continuous-time Markov game (or simply Markov game) is a tuple 
(L, L r , L s , E, R, P, v), consisting of a finite set L of locations, which is partitioned 
into locations L r (controlled by a reachability player) and L s (controlled by a safety 
player), a finite set E of actions, a rate matrix R : (L x E x L) — > Qj»o» a discrete 
transition matrix P : (L x E x L) — >• Qfl [0, 1], and an initial distribution v G Dist(L). 

We require that the following side-conditions hold: For all locations I G L, there must 
be an action a £ £ such that R(Z,a, L) := J2 t , eL R(Z, a, I') > 0, which we call 
enabled. We denote the set of enabled actions in I by I! (I). For a location I and actions 
a G E(l), we require for all locations I' that P(l, a, I') = and we require 

P(Z, a, I') = for non-enabled actions. We define the size \A4\ of a Markov game as 
the number of non-zero rates in the rate matrix R. 

A Markov game is called uniform with uniform transition rate A, if R(7, a, L) = A 
holds for all locations I and enabled actions a G We further call a Markov game 

normed, if its uniformisation rate is 1. Note that for normed Markov games we have 
R = P. We will present our results for normed Markov games only. The following 
lemma states that our algorithms for normed Markov games can be applied to solve 
general Markov games. 

Lemma 1. We can adapt an 0(f(A4)) time algorithm for normed Markov games to 
solve general Markov games in time 0(f(M) + \L\). 

We are particularly interested in Markov games with a single player, which are 
continuous-time Markov decision processes (CTMDPs). In CTMDPs all positions be- 
long to the reachability player (L = L r ), or to the safety player (L — L s ), depending 
on whether we analyse the maximum or minimum reachability probability problem. 



Fig. 1. Left: a normed Markov game. Right: the function / within [0, 4] for In and lg. 



As a running example, we will use the normed Markov game shown in the left 
half of Figure Q] Locations belonging to the safety player are drawn as circles, and 
locations belonging to the reachability player are drawn as rectangles. The self-loops of 
the normed Markov game are omitted. The locations G and _L are absorbing, and there 
is only a single enabled action for I. It therefore does not matter which player owns I, 
G, and _L. 

Schedulers and Strategies We consider Markov games in a time interval [0, T] with 
T G M>o- The non-determinism in the system needs to be resolved by a pair of strate- 
gies for the two players which together form a scheduler for the whole system. For- 
mally, a strategy is a function in Paths r / a x [0,T] — > S, where Paths r and Paths s 

are the sets of finite paths Iq ao,t "> l x , . , — S l n with l n 6 L r and Z„ £ L s , 
respectively, and we use S r and S s to denote the strategies of reachability player 
and the strategies of safety player, respectively. (For technical reasons one has to re- 
strict the schedulers to those which are measurable. This restriction, however, is of no 
practical relevance. In particular, simple piecewise constant timed-positional strategies 
L x [0, T] — > U suffice for optimal scheduling H17I15I2I . and all schedulers that occur 
in this paper are from the particularly tame class of cylindrical schedulers fV7\ .) 

If we fix a pair (S r ,S s ) of strategies, we obtain a deterministic stochastic process, 
which is in fact a time inhomogeneous Markov chain, and we denote it by Ms r „■ For 
t < T, we use Prs r+S (t) to denote the transient distribution at time t over S under the 
scheduler (S r ,S s ). 

Given a Markov game Ai, a goal region G C L, and a time bound T e M>o, 
we are interested in the optimal probability of being in a goal state at time T (and the 
corresponding pair of optimal strategies). This is given by: 



where Pr$ r+B (Z, T) := Prs r+s (T) (I). It is commonly referred to as the maximum time- 
bounded reachability probability problem in the case of CTMDPs with a reachability 
player only. For t < T, we define / : Lx R> — > [0, 1], to be the optimal probability to 
be in the goal region at the time bound T, assuming that we start in location I and that 
t time units have passed already. By definition, it holds then that f(l, T) = 1 if I £ G 




and f(l, T) = if I $ G. Optimising the vector of values /(•, 0) then yields the optimal 
value and its optimal piecewise deterministic strategy. 

Let us return to the example shown in Figure [TJ The right half of the Figure shows 
the optimal reachability probabilities, as given by /, for the locations In and Is when 
the time bound T = 4. The points t\ k, 1.123 and t2 ~ 0.609 represent the times 
at which the optimal strategies change their decisions. Before t\ it is optimal for the 
reachability player to use action b at I r, but afterwards the optimal choice is action a. 
Similarly, the safety player uses action b before t 2 , and switches to a afterwards. 

Characterisation of / We define a matrix Q such that Q(7, a, I') = R(Z, a, I') if I' ^ I 
and Q(Z, a, I) = — J2i>^i -^(^ °j 0- The optimal function / can be characterised as the 
following set of differential equations |2|, see also H13I12H . For each I E L we define 
f{l, T) = 1 if I € G, and if I G. Otherwise, for t < T, we define: 

-f(l,t)= opt VQ(!,M')7(!',i), (1) 

where opt 6 {max, min} is max for reachability player locations and min for safety 
player locations. We will use the opt-notation throughout this paper. 
Using the matrix R, Equation dTJ can be rewritten to: 

-/(M)= opt £)R(J, a, l')-(f(l',t) -f(l,t)) (2) 
aes(i) VeL 

For uniform Markov games, we simply have Q(7, a, I) = R(7, a, I) — A, with A = 1 
for normed Markov games. This also provides an intuition for the fact that uniformisa- 
tion does not alter the reachability probability: the rate a, I) does not appear in ([TJ. 

3 Approximating Optimal Control for Normed Markov Games 

In this section we describe e-nets, which are a technique for approximating optimal 
values and strategies in a normed continuous-time Markov game. Thus, throughout the 
whole section, we fix a normed Markov game M = (L, L r , L s , E, R, P, v). 

Our approach to approximating optimal control within the Markov game is to break 
time into intervals of length e, and to approximate optimal control separately in each 
of the \y~\ distinct intervals. Optimal time-bounded reachability probabilities are then 
computed iteratively for each interval, starting with the final interval and working for- 
wards in time. The error made by the approximation in each interval is called the step 
error. In Section IXT1 we show that if the step error in each interval is bounded, then the 
global error made by our approximations is also bounded. 

Our results begin with a simple approximation that finds the optimal action at the 
start of each interval, and assumes that this action is optimal for the duration of the 
interval. We refer to this as the single e-net technique, and we will discuss this approxi- 
mation in Section [Jl2l While it only gives a simple linear function as an approximation, 
this technique gives error bounds of 0(e 2 ), which is comparable to existing techniques. 

However, single e-nets are only a starting point for our results. Our main observation 
is that, if we have a piecewise polynomial approximation of degree c that achieves an 



error bound of 0(e k ), then we can compute a piecewise polynomial approximation of 
degree c + 1 that achieves an error bound of 0(e k+1 ). Thus, starting with single e- 
nets, we can construct double e-nets, triple e-nets, and quadruple e-nets, with each of 
these approximations becoming increasingly more precise. The construction of these 
approximations will be discussed in Sections [3.31 and [3~4l 

In addition to providing an approximation of the time-bounded reachability proba- 
bilities, our techniques also provide positional strategies for both players. For each level 
of e-net, we will define two approximations: the function pi is the approximation for 
the time-bounded reachability probability given by single e-nets, and the function gi 
gives the reachability probability obtained by following the positional strategy that is 
derived from pi. This notation generalises to deeper levels of e-nets: the functions p2 
and 52 are produced by double e-nets, and so on. 

We will use £(k,e) to denote the difference between pk and /. In other words, 
£ (fc, e) gives the difference between the approximation p^ and the true optimal reach- 
ability probabilities. We will use £ s (k,e) to denote the difference between and /. 
We defer formal definition of these measures to subsequent sections. Our objective in 
the following subsections is to show that the step errors £(k,e) and £ s (k,e) are in 
0(e k+1 ), with small constants. 

3.1 Step Error and Global Error 

In subsequent sections we will prove bounds on the e-step error that is made by our 
approximations. This is the error that is made by our approximations in a single interval 
of length e. However, in order for our approximations to be valid, they must provide a 
bound on the global error, which is the error made by our approximations over every 
e interval. In this section, we prove that, if the e-step error of an approximation is 
bounded, then the global error of the approximation is bounded by the sum of these 
errors. 

We define / : [0, T] -> [0, 1]I L I as the vector valued function f(t) i-> (g) (eL f(l, t) 
that maps each point of time to a vector of reachability probabilities, with one entry 
for each location. Given two such vectors f(t) and p(t), we define the maximum norm 
||/(t) — p(t)\\ = max{|/(Z, t) — p(l,t)\ | I £ L}, which gives the largest difference 
between /(/, t) and p(l, t). 

We also introduce notation that will allow us to define the values at the start of an 
e interval. For each interval [t — e,t], we define /*:[*- e,t] -> [0, 1]I L I to be the 
function obtained from the differential equations (HJ when the values at the time t are 
given by the vector x £ [0, 1]' L K More formally, if r = t then we define /*(r) = x, 
and if t — e < t < t and I £ L then we define: 



The following lemma states that if the e-step error is bounded for every interval, 
then the global error is simply the sum of these errors. 

Lemma 2. Let p be an approximation of f that satisfies ||/(i) — < f J -f or some 

timepointt £ [0,T].If\\f^, t Jt—s)—p(t—e)\\ < v thenwe have \\f(t—e)—p(t—s)\\ < 




(3) 



3.2 Single e-Nets 



In single e-nets, we compute the gradient of the function / at the end of each interval, 
and we assume that this gradient remains constant throughout the interval. This yields 
a linear approximation function px, which achieves a local error of e 2 . 

We now define the function px- For initialisation, we define pi(l, T) = 1 if I G G 
and pi(l, T) = otherwise. Then, if px is defined for the interval [t, T], we will use 
the following procedure to extend it to the interval [t — e,T]. We first determine the 
optimising enabled actions for each location for f^ns at time t. That is, we choose, for 
all I G L and all a G an action: 



aes(i) VeL 

We then fix c\ = X^'eL QC> a /' ' *) as tne descent of •) in the interval 
[t — e, t). Therefore, for every r G [0, e] and every ( e Lwe have: 



Let us return to our running example. We will apply the approximation p\ to the 
example shown in FigureQ] We will set e = 0.1, and focus on the interval [1.1,1.2] with 
initial values pi(G, 1.2) = 1, 1.2) = 0.244, p\(l R , 1.2) = 0.107, pi(/ 5 , 1.2) = 
0.075, pi(-L, 1.2) = 0. These are close to the true values at time 1.2. Note that the 
point ti, which is the time at which the reachability player switches the action played 
at is contained in the interval [1.1, 1.2]. Applying Equation with these values 
allows us to show that the maximising action at Ir is a, and the minimising action at Is 
is also a. As a result, we obtain the approximation pi(Ib., t — t) = 0.0286t + 0.107 
andpi(7 s , t - t) = 0.032t + 0.075. 

We now prove error bounds for the approximation p\. Recall that £ (1, r) denotes 
the difference between / and p\ after t time units. We can now formally define this 
error, and prove the following bounds. 

Lemma 3. Ife < 1, then£(l,e) := ||/* l(t) (i - e) ~Pi(t - e)|| < £ 2 - 

The approximation p\ can also be used to construct strategies for the two players 
with similar error bounds. We will describe the construction for the reachability player. 
The construction for the safety player can be derived analogously. 

The strategy for the reachability player is to play the action chosen by p\ during the 
entire interval [t—e, t]. We will define a system of differential equations g\ (I, r) that de- 
scribe the outcome when the reachability fixes this strategy, and when the safety player 
plays an optimal counter strategy. For each location I, we define gi(l, t) = f^u^l, t), 
and we define gi(l, t), for each r G [t — e, t], as: 




(4) 



pi(l,t — t) = c* and px(l, t — r) = pi(l, t) + r • cf. 




if / G L r , 



(5) 




if / G L s . 



(6) 



We can prove the following bounds for £ s (l, e), which is the difference between g\ 
and f*^ , t , on an interval of length e. 

Lemma 4. We have £ s (l, e) := \\g x {t - e) - /p l(t) (* - e)|| < 2 • e 2 . 

Lemma |3] gives the e-step error for pi, and we can apply Lemma [2] to show that 
the global error is bounded by e 2 • f = eT. If 7r is the required precision, then we 
can choose e = ^ to produce an algorithm that terminates after ~ « ^- many steps. 
Hence, we obtain the following known result. 

Theorem 1. For a no rmed Markov game M. of size \M.\, we can compute a n-optitnal 
strategy and determine the quality of M up to precision n in time 0(\M.\ • T ■ —). 

3.3 Double e-Nets 

In this section we show that only a small amount of additional computation effort needs 
to be expended in order to dramatically improve over the precision obtained by single 
e-nets. This will allow us to use much larger values of e while still retaining our desired 
precision. 

In single e-nets, we computed the gradient of / at the start of each interval and 
assumed that the gradient remained constant for the duration of that interval. This gave 
us the approximation p%. The key idea behind double e-nets is that we can use the 
approximation pi to approximate the gradient of / throughout the interval. 

We define the approximation p2 as follows: we have p2(l,T) = 1 if I e G and 
otherwise, and if p2 (I , r) is defined for every I £ L and every r 6 [t,T], then we define 
P2,(l> t) for every r G [t — e, t] as: 

-fc{l,T)= opt ^R(U,0-(PiG',t)-Pi(Z,t)) VI £L. (7) 

By comparing Equations © and (O, we can see that double e-nets uses p\ as an ap- 
proximation for / during the interval [t — e, t}. Furthermore, in contrast to pi, note that 
the approximation p2 can change it's choice of optimal action during the interval. The 
ability to change the choice of action during an interval is the key property that allows 
us to prove stronger error bounds than previous work. 

Lemma 5. Ife < 1 then 5(2, e) := \\p 2 (r) - 4 (t) (r)|| < |e 3 . 

Let us apply the approximation p2 to the example shown in Figure[T] We will again 
use the interval [1.1, 1.2], and we will use initial values that were used when we applied 
single e-nets to the example in Section l3?2l We will focus on the location Ir. From the 
previous section, we know that pi(Ir, t — r) = 0.0286r + 0.107, and for the actions a 
and b we have: 

• J2veL^(^b,l')pi(l',t-T) = Jpi(Z,t-r) + §pi(Zfl,t-r). 



[1.1 



1.2 -z 1.2] 



Fig. 2. This figure shows how — p2 is computed on the interval [1.1,1.2] for the location 
Ir. The function is given by the upper envelope of the two functions: it agrees with 
the quality of a on the interval [1.2 — z, 1.2] and with the quality of b on the interval 
[1.1, 1.2 -z}. 

These functions are shown in Figure [2] To obtain the approximation p 2 , we must take 
the maximum of these two functions. Since p% is a linear function, we know that these 
two functions have exactly one crossing point, and it can be determined that this point 
occurs when p\(l, t — r) = 0.25, which happens at r = z :— gj. Since z < 0.1 = e, 
we know that the lines intersect within the interval [1.1, 1.2]. Consequently, we get the 
following piecewise quadratic function for p2'- 

• When < r < z, we use the action a and obtain —P 2 (Ir, t — r) = — 0.00572r + 
0.0286, which implies that p 2 (lR,t-r) = -0.00286t 2 + 0.0286r + 0.107. 

• Whenz < t < 0.1 we use action b and obtain —PiQfat—T) = 0.0094t + 0.0274, 
which implies that p 2 (l R , t - r) = 0.0047r 2 + 0.0274t + 0.107047619. 

As with single e-nets, we can provide a strategy that obtains similar error bounds. 
Once again, we will consider only the reachability player, because the proof can easily 
be generalised for the safety player. In much the same way as we did for g\, we will 
define a system of differential equations g 2 (I, r) that describe the outcome when the 
reachability player plays according to p 2 , and the safety player plays an optimal counter 
strategy. For each location /, we define g 2 {l,t) = /p 2 (t)(M)' If a J denotes the action 
that maximises Equation (f7]i at the time point re [t — e, t], then we define g 2 (I, r), as: 

-g 2 {l,r) = J2Q(haj,l')-g 2 (l',r) ifleL r , (8) 

- 92 (i,r) = mm VQ(i,a,0'j2(i',r) ifiGi.. (9) 

The following lemma proves that difference between 52 and /* 2 ( t j has similar bounds 
to those shown in Lemma|5] 

Lemma 6. Ife < 1 then we have £ s (2,e) := \\g 2 (t - e) — fp 2 t^(t - e)\\ < 2 ■ e 3 . 

Computing the approximation P2 for an interval [t — e,t] is not expensive. The fact 
that pi is linear implies that each action can be used for at most one subinterval of 
[t — e, t]. Therefore, there are less than \S\ points at which the strategy changes, which 
implies that P2 is a piecewise quadratic function with at most \2J\ pieces. It is possible 
to design an algorithm that uses sorting to compute these switching points, achieving 
the following complexity. 



Lemma 7. Computing p 2 for an interval [t - e, t] takes 0(|-M| + \L\ ■ \S\ ■ log \S\) 
time. 



Since the e-step error for double e-nets is bounded by e 3 , we can apply Lemma|2] 
to conclude that the global error is bounded by e 3 • -j — e 2 T. Therefore, if we want to 

t ^ t 1 " 



compute / with a precision of 7T, we should choose e ~ VT' w hi°h gives ^ w ±-j=- 
distinct intervals. 

Theorem 2. For a normed Markov game M. we can approximate the time-bounded 
reachability, construct it optimal memoryless strategies for both players, and determine 

the quality of these strategies with precision ir in time O ( | Ai ■ T ■ ^J~^ + \L\ ■ T ■ ■ ^ 
\X\ log 1^1). 



3.4 Triple e-Nets and Beyond 

The techniques used to construct the approximation p 2 from the approximation pi can 
be generalised. This is because the only property of pi that is used in the proof of 
Lemma [5] is the fact that it is a piecewise polynomial function that approximates /. 
Therefore, we can inductively define a sequence of approximations pk as follows: 



-p k (l,r)= opt ^R(Z,a,0-(p fc _i(r,T)-p fc _i(/,T)) 
aes(i) V£L 



(10) 



We can repeat the arguments from the previous sections to obtain the following error 
bounds: 

Lemma 8. For every k > 2, if we have £ (k, e) < c • e k+1 , then we have £ (k + 1, e) < 
_2_ , c , £ fe+2 Moreover, if we additionally have that £ s (k, e) < d ■ e k+1 , then we also 

have that £ s (k + 1, e) < k+2 



Sc+3d _ £ k+2 



Computing the accuracies explicitly for the first four levels of e-nets gives: 



k 


1 


2 


3 


4 




£(k,e) 




■I'-i 




2 p 5 

15 c 




£ s {k,e) 


2e' 2 


2e 3 


6 L 


67 5 

30 fc 





We can also compute, for a given precision n, the value of e that should be used in 
order to achieve an accuracy of 7r with e-nets of level k. 



Lemma 9. To obtain a precision 7r with an e-net of level k, we choose e f» v/J> 
resulting in y w T steps. 

Unfortunately, the cost of computing e-nets of level k becomes increasingly pro- 
hibitive as k increases. To see why, we first give a property of the functions pk- Recall 
that p2 is a piecewise quadratic function. It is not too difficult to see how this generalises 
to the approximations pfc. 



Lemma 10. The approximation pk is piecewise polynomial with degree less than or 
equal to k. 

Although these functions are well-behaved in the sense that they are always piece- 
wise polynomial, the number of pieces can grow exponentially in the worst case. The 
following lemma describes this bound. 

Lemma 11. If pk-i has c pieces in the interval [t — e, t], then pk has at most | ■ c ■ k ■ 
\L\ ■ \S\ 2 pieces in the interval [t — e, t). 

The upper bound given above is quite coarse, and we would be surprised if it were 
found to be tight. Moreover, we do not believe that the number of pieces will grow 
anywhere close to this bound in practice. This is because it is rare, in our experience, 
for optimal strategies to change their decision many times within a small time interval. 

However, there is a more significant issue that makes e-nets become impractical 
as k increases. In order to compute the approximation pk, we must be able to compute 
the roots of polynomials with degree k — 1. Since we can only efficiently compute the 
roots of quadratic functions, and efficiently approximate the roots of cubic functions, 
only the approximations p 3 andp4 are realistically useful. 

Once again it is possible to provide a smart algorithm that uses sorting in order to 
find the switching points in the functions ps and p4, which gives the following bounds 
on the cost of computing them. 

Theorem 3. For a normed Markov M. we can construct tt optimal memoryless strate- 
gies for both players and determine the quality of these strategies with precision 

7r in time 0(\L\ 2 ■ • T ■ l^l 4 log \S\) when using triple e-nets, and in time 
0(\L\ 3 ■ {j% ■ T ■ \£\ 6 log \S\) when using quadruple e-nets. 

It is not clear if triple and quadruple e-nets will only be of theoretical interest, or if 
they will be useful in practice. It should be noted that the worst case complexity bounds 
given by Theorem [5] arise from the upper bound on the number of switching points 
given in Lemma [TT] Thus, if the number of switching points that occur in practical 
examples is small, these techniques may become more attractive. Our experiments in 
the following section give some evidence that this may be true. 

4 Experimental Results and Conclusion 

In order to test the practicability of our algorithms, we have implemented both double 
and triple-e nets. We evaluated these algorithms on two sets of examples. Firstly, we 
tested our algorithms on the Erlang-example (see Figure [3]) presented in [5| and |18|. 
We chose to consider the same parameters used by those papers: we consider maximal 
probability to reach location I4 from within 7 time units. Since this example is a 
CTMDP, we were able to compare our results with the Markov Reward Model Checker 
(MRMC) [5] implementation, which includes an implementation of the techniques pro- 
posed by Buckholz and Schulz. 




Fig. 3. A CTMDP offering the choice between a long chain of fast transition and a 
slower path that looses some probability mass in 

We also tested our algorithms on continuous-time Markov games, where we used 
the model depicted in Figure|4] consisting of two chains of locations l\, I2, . . . , hoo and 
l[, 1' 2 , . . . , l' 1Q0 that are controlled by the maximising player and the minimising player, 
respectively. This example is designed to produce a large number of switching points. In 
every location U of the maximising player, there is the choice between the short but slow 
route along the chain of maximising locations, and the slightly longer route which uses 
the minimising player's locations. If very little time remains, the maximising player 
prefers to take the slower actions, as fewer transitions are required to reach the goal 
using these actions. The maximiser also prefers these actions when a large amount of 
time remains. However, between these two extremes, there is a time interval in which 
it is advantageous for the maximising player to take the action with rate 3. A similar 
situation occurs for the minimising player, and this leads to a large number of points 
where the players change their strategy. 

The results of our experiments are shown in Table [4] The MRMC implementation 
was unable to provide results for precisions beyond 1.86- 10~ 9 . For the Erlang examples 
we found that, as the desired precision increases, our algorithms draw further ahead 
of the current techniques. The most interesting outcome of these experiments is the 
validation of triple e-nets for practical use. While the worst case theoretical bounds 
arising from Lemma QT| indicated that the cost of computing the approximation for 
each interval may become prohibitive, these results show that the worst case does not 
seem to play a role in practice. In fact, we found that the number of switching points 
summed over all intervals and locations never exceeded 2 in this example. 

Our results on Markov games demonstrate that our algorithms are capable of solving 
non-trivially sized games in practice. Once again we find that triple e-nets provide a 




Fig. 4. A CTMG with many switching points. 
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Table 2. Experimental evalutation of our algorithms. 



substantial performance increase over double e-nets, and that the worst case bounds 
given by Lemma [TT| do not seem occur. Double e-nets found 297 points where the 
strategy changed during an interval, and triple e-nets found 684 such points. Hence, the 
\L\\S\ 2 factor given in LemmafTTIdoes not seem to arise here. 
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A Proof of Lemma U 



We first show how our algorithms can be used to solve uniform Markov games, and then 
argue that this is sufficient to solve general Markov games. In order to solve uniform 
Markov games with arbitrary uniformisation rate A, we will define a corresponding 
normed Markov game in which time has been compressed by a factor of A. More pre- 
cisely, for each Markov game Ai — (L, L r , L s , 27, R, P, v) with uniform transition 
rate A > 0, we define A("'" = (L, L r , L s , 27, P, P, v), which is the Markov game that 
differs from Ai only in the rate matrix. In particular, we replace R with P = iR. The 
following lemma allows us to translate solutions of Ai"'" to Ai. 

Lemma 12. For every uniform Markov game Ai, if we have approximated the optimal 
time-bounded reachability probabilities and strategies in A^" " with some precision tt 
for the time bound T, then we can approximate optimal time-bounded reachability prob- 
abilities and strategies in Ai with precision it for the time bound XT. 

Proof: To prove this claim, we define the bijection b : S[Ai^] — > S[Ai] between 
schedulers of Ai"'" and Ai that maps each scheduler s 6 5 , [A^"",T] to a scheduler 
s' e S[M, AT] with s'(l, t) = s(l, Xt) for all t £ [0, T]. In other words, we map each 
scheduler of Ai"'" to a scheduler of Ai in which time has been stretched by a fac- 
tor of A. It is not too difficult to see that the time-bounded reachability probability for 
time bound AT in Ai under s' = b(s) is equivalent to the time-bounded reachability 
probability for time bound T for under s. This bijection therefore proves that 

the optimal time-bounded reachability probabilities are the same in both games, and it 
also provides a procedure for translating approximately optimal strategies of the game 
Af"'" to the game Ai. Since the optimal reachability probabilities are the same in both 
games, an approximation of the optimal reachability probability in Ai"'" with preci- 
sion 7r must also be an approximation of the optimal reachability probability in A^" " 
with precision tt. □ 
In order to solve general Markov games we can first uniformise them, and then 
apply Lemma [T2l If Ad = (L, L r , L s , 27, R, P, v) is a continuous-time Markov game, 
then we define the uniformisation of M asunif(Af) = (L, L r , L s , 27, R', P, v), where 
R' is defined as follows. If A = max/ e £ max a£ £(j) R(Z, a,L \ {/}), then we define, 
for every pair of locations 1,1' 6 L, and every action a G S(l): 



Previous work has noted that, for the class of late schedulers, the optimal time- 
bounded reachability probabilities and schedulers in M are identical to the optimal 
time-bounded reachability probabilities and schedulers in unif(A / () ifTTl . To see why, 
note that Equation (O does not refer to the entry R'(i, a, I), and therefore the modifi- 
cations made to the rate matrix by uniformisation can have no effect on the choice of 
optimal action. 

Lemma 13. HI 7V For every continuous-time Markov game Ai, the optimal time- 
bounded reachability probabilities and schedulers of Ai are identical to the optimal 
time-bounded reachability probabilities and schedulers o/unif (Ai). 




B Proof of Lemma |2| 



Proof: In order to prove this lemma, we will show that \\f(t — e) — (t — e) \\ < pi. 
This implies the claimed result, because by assumption we have \\fp^(t — s) — p(t — 

< v, and therefore the triangle inequality implies that \\f(t — e) —p(t—s)\\ < n + v. 

We first prove that \\f(t — r) — fjf t \ +c (t — r)|| — c for every constant c and every 
t G [0, e], In other words, by increasing the values of each location by c at time t, we 
increase the values given by / by c on the interval [t — e, t]. To see this, note that by 
definition we have X4'ez,Q(^' a '0 = for every action a, and therefore if we have 
f'(t — t,1) = f(t — t,1) + c for some time r G [0, e], and every location I, then we 
have: 

53 Q(l, a, 0/'(J, t-r) = ^ Q(U,0(/(M-r) + c) = 53 Q(l,a,l')f(l,t-r). 
I'eL I'eL I'eL 

We can then use this equality and Equation (0) to conclude that —fju\ +c (l, t — r) = 

— /(/, i — t) for every r G [t — e, t], and therefore, we have /^ t j +c (< — r) — /(t— r) = c 

for everyr G [0,£]. Since ||/(i)—p(i)|| < /i, it follows that — e) — /p(t)(^ — £ )l — A* 
as required. □ 

C Proof of Lemma |3] 

Lemma 14. If s < 1, then we have pi{l,t) G [0, 1] for all t G [0,T]. 

Proof: We will prove this by induction over the intervals [t — e,t]. The base case is 
trivial since we have by definition that eitherpi (I, T) — orpi (l,T) = 1. Now suppose 
that pi (l.t) G [0, 1] for some e-interval [t — e,t]. We will prove that p\{l, t — r) G [0, 1] 
for all r G [0, e]. 

From the definition of pi, we know that p\(l, £ — r) = cf = R(Z, af, V) ■ (pi[l', t) — 
Pi(l, t)) for all t G [0, e], Therefore, since r < e < 1 we have: 



Pi(J, t - t) = pi(I,t) + r • 53 R(l, of, I') • (pi(T, t) - Pl (l, t)) 

I'eL 

< Pl (i, t) + 53 r(z, of, n • t) - *)) 



/'Si 

= (i-53R(i,of,n) • pi (i,t)+ 53 rm,o-pi (*',*). 

Since we are considering normed Markov games, we have that X^'^i ^-(^ a ; > — 1> 
and therefore pi (£, i — r) is a weighted average over the values pi(l', t) where I' G L. 
From the inductive hypothesis, we have that pi(l',t) G [0, 1] for every I' G L, and 
therefore a weighted average over these values must also lie in [0, 1]. □ 

Lemma 15. If e < 1 then we have —f^mih t - t) G [— 1, 1] for every r G [0, s]. 



Proof: Lemma [T4l implies that f^^(l,t) = p\{l,t) £ [0, 1] for all Z e L. Since the 
system of differential equations given by (0 gives optimal reachability probabilities 
under the assumption that /^(^(M) = Pi(ht), we must have that /^(^(M — T ) £ 
[0,1] for all r e [0,e]. 

We first prove that ~fp 1 a)(ht — T ) < 1- We will prove this for the reachability 
player, the proof for the safety player is analogous. By definition we have: 

-fl(l, t-r)= max ]T R(Z, a, 0(/p l( t) (/', t - r) - /* l(t) (Z, t - r)). 

Since we have shown that fp 1 (f\Q'' >t ~ T ) S [0,1] for all Z, and we have 
J2i>£L R (^' a ' '0 = 1 f° r ever y action a in a normed Markov game, we obtain: 

max R (^ «> 0(&(t)(^ t-T)-f pi(t) (l, t-r)) < max £ R(Z, a, Z')(l-0) = 1 

v ' I'EL v ; I'EL 

To prove that —f^m (I, t — r) > —1 we use a similar argument: 

max R (^ W Pl (t)(^ ^-fUt) V, t-r)) > max £ R(Z, a, Z')(0-1) = -1 

v i'GL v ' I'EL 

Therefore we have ~fp 1 M (l> t — t) G [—1,1]. □ 

We can now provide a proof for Lemma[3] 
Proof: Lemma [T31 implies that ~fp 1 (f\(ht ~ T ) S [ — 1,1] for every r s [0,e]. Since 
the rate of change of f pi M is in the range [—1, 1], we know that can change by at 
most t in the interval [f — t, t]. We also know that fp ± M (l,t) = Pi(l, t), and therefore 
we must have the following property: 

ll&(t)(M- r )-Pi(M)H r - (H) 

The key step in this proof is to show that \\fpr t ^(l,t — t) — pi(l,t — t)\\ < 2 • r for 
all r S [0, e]. Note that by definition we have p\(l, t — r) = pi(Z, t) for all r e [0,e], 
and so it suffices to prove that Wf^uM, t — r) — pi(Z,t)|| < 2 • r. 

Suppose that Z is a location for the reachability player, let aj be the optimal action 
at time t, and let a\~ T be the optimal action at t — r. We have the following: 

- Pl (Z,t) - 2 • r = J2 B.M,l')(pi(!',t)-pi(!,t)) - 2 • r 

i'iEL 

< J] R(Z, of, 0(4 (t) (I', t-r)- fp m {l, t - r)) 

I'EL 

< ]T m^rWp^i^t - r) - & (t) (l,t- r)) - -f* Mt) (l,t- r) 
vel 

The first equality is the definition of — p\{l, t). The first inequality follows from Equa- 
tion (fTTI) and the fact that R(Z, a, Z') = 1. The second inequality follows from the fact 



that a, T is an optimal action at time t — r, and the final equality is the definition of 
~fpi(t) ^ — r )' Using the same techniques in a different order gives: 

-/* l(t) a, t-r) = j2 at r> I'Kfkn (*', * - t) - 4w (*> * - *■)) 

2'eL 

< ^ R(/,a*"V)(pi (/',*) ~Pi(l,t)) + 2 • r 

< J] R(Z, af , 0(Pi(i', *) ~ Pi(i, *)) + 2 • r = -px(l, t) + 2-r 
I'eL 

To prove the claim for a location I belonging to the safety player, we use the same 
arguments, but in reverse order. That is, we have: 

-mi, t) - 2 • r = a i l ')(pi(i', t) - pi(i, m - 2 • r 

< ]T R(l, a\-\ l')( Pl (l', t) - Pl (l, t))-2-r 
I'eL 

< J2 4'^ 0(&(t)G'> t-r)- f Mt) (l, t-r)) = -& (t) (/, * - r) 

We also have: 

-/* 1(t) (i, « - t) = x; r (^ o(/p 1(t) a', « - t) - /* 1(t) (/, * - r)) 

< ^ R(/,af, l')(f Mt) (l', t-r)- f pi(t) (l,t- r)) 

< ^ R(Z, of, I')(Pi (*'. *) - Pi (J, *)) + 2 • r = -pi (l,t) + 2-r 
I'eL 

Therefore, we have shown that Wf^n^ih t — r) — pi(l, t — r)|| < 2 • r for all r g [0, e] 
and for every I £ L, 

We now complete the proof by arguing that || ^ (i — r) — f>i — r) || < t 2 . Let 

the d(l, t — t) := llf^M^ — T ) — Pi(* — r )ll f° r every t € [0, e]. Our arguments so 
far imply: 

d(l,t-r) < \\f t Mt) (t-r)-p 1 (t-r)\\<2-T. 
Therefore, we have: 

d(l,t-r) < / 2-rdr = T 2 . 
Jo 

This allows us to conclude that £(l,e) := Wf^n^t — s) - pi(t - e)\\ <e 2 . □ 

D Proof of Lemma 3] 



We begin by proving the following auxiliary lemma, which shows that the difference 
between pi and g\ is bounded by e 2 . 



Lemma 16. We have \\gi(t - e) - p\{t — e)\\ < e 2 . 

Proof: Suppose that we apply single-e nets to approximate the solution of the system 
of differential equations g\ over the interval [t — e, t] to obtain an approximation p\ . To 
do this, we select for each location I G L s an action a that satisfies: 

a G argopt^ Q(l,aj') -gi(l',t). 

Since gi(l, t) ~ /p^Gi = Pi(h f° r every location i, we have that a = a\, where 
a\ is the action chosen by p\ at I. In other words, the approximations p\ and p\ choose 
the same actions for every location in L s . Therefore, for all locations / E L, we have 
cj = Y,i'£L Qih al I') ■ pf(l', t) = Y.veL Qih ah I') -pi(l', t), which implies that for 
every time r G [0, e] we have: 

p{(l,t-r)=p{(l,t)+T-cj= Pl (l,t) + T-cj= Pl (l,t-r). 

That is, the approximations p\ and p\ are identical. 

Note that the system of differential equations g\ describes a continuous-time 
Markov game in which some actions for the reachability player have been removed. 
Since g\ describes a CTMG, we can apply Lemma|3]to obtain \g\{t—T)—p\{t—r)\ < 
e 2 . Since P \(t — e) = p\(t — e), we can conclude that ||<?i(i — r) — P x(t — r)|| < e 2 . 

a 

Lemma|4]now follows from Lemma[l6]and Lemma[3] 

E Proof of Theorem ffl 

Proof: As we have argued in the main text, in order to guarantee a precision of tt, it 
suffices to choose e = Si, which gives — many intervals [t — e, t] for which pi must 
be computed. It is clear that, for each interval, the approximation p\ can be computed 
in 0{M) time, and therefore, the total running time will be 0(|tV1| • T ■ — ). □ 

F Proof of Lemma |5] 

Proof: We begin by considering the system of differential equations that define p2, as 
given in Equation (0: 

-MhT)= opt ^2K(l,a,l')-(px(l',r)- Pl (l,T)) V/ G L. 

aeS(l) VeL 

The error bounds given by Lemma[3]imply that \\pi(t — r) — fp 2 n\{t — r)|| < r 2 for 
every r G [0, e]. Therefore, for every pair of locations 1,1' G L and every r G [i — e, t] 
we have: 

|| (?!(*',< - r) - p!(l,t - r)) - (f Mt) (l',t- r) & (t) (I, t - r))|| < 2 ■ r 2 . 



Since we are dealing with normed Markov games, we have Y^i>pl R (^> a > = 1 f° r 
every location I G L and every action a e A(V). Therefore, we also have for every 
action a: 

||5^R(i,a,0- W,*-T)-pi(i,i-r)) 

- ]T R(J, a, /') • (4 (t) (/', i - r) - 4 (t) (J, i - r) || < 2 • r 2 . 

I'eL 

This imphes that ||p 2 (Z, t - r) - /£ (t) (J, i - r) || < 2 • r 2 . 

We can obtain the claimed result by integrating over this difference: 

lb2at-r)-4 (t) (i,t-r)||=y^||^(i J t-r)-4 (t) a,t-r)|| < jjr 3 . 

Therefore, the total amount of error incurred by p2 in the interval [t — e, t] is at most 



G Proof of Lemma |6] 

To begin, we prove an auxiliary lemma, that will be used throughout the rest of the 
proof. 

Lemma 17. Let f and g be two functions such that \\f(t — r) — git — r)|| < c ■ r fc . T/' 
a/ is an action that maximises (resp. minimises) 

opt ^Q(/,a,0-/a',i-r), (12) 

and a g is is an action that maximises ( resp. minimises) 

opt ^2Q(l,aj')-g(l',t-r), (13) 

then we have: 

|| R (^ • 9(t ~t)-J2 R (^ « 7 > • /(* - r)\\ < 3 • c ■ r k . 
I'eL veL 

Proof: We will provide a proof for the case where the equations must be maximises, 
the proof for the minimisation case is identical. We begin by noting that the property 
\\f(t — t) — g(t — t)\\ < c ■ r k , and the fact that we consider only normed Markov 
games imply that, for every action a we have: 

|| Q(*» a- • /(*'. 1 - T ) - E QC' a ' *') • * - t) || < 2 • c ■ r fe . (14) 
I'eL I'eL 



We use this to claim that the following inequality holds: 

|| ]T Q(Z, a\ • g(l>, t-r)-J2 QG, « / ■ fQ', t - r)\\ <2 ■ c ■ r k . (15) 

I'eL I'eL 

To see why, suppose that 

Q(l, a 9 , • g(l', t-r)>J2 Q(i, a 7 , ■ i - r) + 2 • c • r fc . 
2'eL i'ei 

Then we could invoke Equation ( TPfl ) to argue that X^'gl Q('i ' /('') ^ ^ r ) > 
S/'gl a )l')'f(l' it~ T )< which contradicts the fact that a-' 7 achieves the maximum 
in Equation Similarly, if J^i'eL QQ> ai > ' /(''> * ~ r ) > Ej'eL a9 > ' 
g{V , t — t) + 2 ■ c ■ r fc , then we can invoke Equation < TT~4T > to argue that a 9 does not 
achieve the maximum in Equation ( TT~3b . Therefore, Equation (IT~5b must hold. 

Now, to finish the proof, we apply the fact that — r) — — r)|| < c • r fe to 
Equation (fT5l l to obtain: 

|| £ R(/, o», OffG', t - r) - R ( Z ' a/ > 0/ 0'> * " T )H ^ 3 • c • r " 

This complete the proof. □ 
To prove Lemma|6]we will consider the following class of strategies: play the action 
chosen by p 2 for the first k transitions, and then play the action chosen by p\ for the 
remainder of the interval. We will denote the reachability probability obtained by this 
strategy as g\, and we will denote the error of this strategy as (2, e) := Wg^t — e) — 
fp 2 uj(t — e)||. Clearly, as k approaches infinity, we have that g\ approaches g^, and 

£g (2, e) approaches £ s (2, e). Therefore, in order to prove Lemma|6] we will show that 
£f(2,e) < 2 e 3 for all k. 

We will prove error bounds on g\ by induction. The following lemma considers the 
base case, where k = 1. In other words, it considers the strategy that plays the action 
chosen by p2 for the first transition, and then plays the action chosen by p\ for the rest 
of the interval. 

Lemma 18. If e < 1, then we have £]{2, e) < 2 • e 3 . 

Proof: Suppose that the first discrete transition occurs at time t — r, where r £ [0, e]. 
Let I be a location belonging to the reachability player, and let a t l ~ T be the action that 
maximises p 2 at time t — r. By definition, we know that the probability of moving to 
a location V is given by R(7, a*~ T , I'), and we know that the time-bounded reachabil- 
ity probabilities for each state I' are given by gi(l' , t — r). Therefore, the outcome of 
choosing a*~ T at time t — r is X^'gl a * ^' Offi^'j * — T )- ^ a * 15 an action that 
would be chosen by fp 2 M at time t — r, then we have the following bounds: 

|| J2 R (^ a l T > * - r) - £ R(/, a\l')fl 2{t) {l', t-r)\\ 

veL VeL 

<\\J2^<4- T ,l') Pl (l',t-T) - ^R(Z, O *,04 (t) (/',*-r)|| +r 2 

2'GL i'SL 

<4-r 2 



The first inequality follows from Lemma [T6l and the second inequality follows from 
LemmafTTl 

Now suppose that I is a location belonging to the safety player. Since the reachability 
player will follow p\ during the interval [t — r,t], we know that the safety player will 
choose an action a g that minimises: 

min Vq(U/)-<?iO',*-t). 
If a* is the action chosen by / at time t — r, then Lemma|4]and Lemma [T7limplv: 

I'eL VeL 

So far we have proved that the total amount of error made by g\ when the first 
transition occurs at time t — r is at most 6 • r 2 . To obtain error bounds for g\ over 
the entire interval [t — e, t], we consider the probability that the first transition actually 
occurs at time t — r: 



£j(2,s)< [ e T - £ 6T 2 dT < [ 6r 2 dT = 2 
Jo Jo 



This completes the proof. □ 
We now prove the inductive step, by considering g\ , This is the strategy that follows 
the action chosen by p2 for the first k transitions, and then follows p% for the rest of the 
interval. 

Lemma 19. // £*(2, e) < 2 • e 3 for some k, then £* +1 (2, e) < 2 • e 3 . 

Proof: The structure of this proof is similar to the proof of Lemma [T8l however, we 
must account for the fact that 1 follows g\ after the first transition rather than g\. 
Suppose that we play the strategy for g% +1 , and that the first discrete transition 



occurs at time t— r, where r E [0, e]. Let / be a location belonging to the reachability 
player, and let a\~ T be the action that maximises p2 at time t — r. If a* is an action that 
would be chosen by f^n\ at time t — r, then we have the following bounds: 

|| £ R(I, a\-\ l')g k 2 {l\ t-r)-J2 a* ,l')f Mt) {l\ t r)\\ 
I'eL veh 

<\\ R(/,a*~ T ,?')Pi(^',i-T) - R(Z,a*,Z')/* 2 ( f )(i',t - t)|| + t 2 + 2 • t 3 



I'eL I'eL 
<4 • r 2 + 2 • r 3 < 6 • r 2 



The first inequality follows from the inductive hypothesis, which gives bounds on how 
far g\ is from ^ , and from Lemma [3] which gives bounds on how far /* a ^ is 
from pi. The second inequality follows from Lemma [3] and Lemma [TTl and the final 
inequality follows from the fact that r < 1. 



Now suppose that the location I belongs to the safety player. Let a g be an action 
that minimises: 



iniaTQO.fl.O-^'.t-r). 
If a* is the action chosen by / at time t — t, then Lemma|4]and Lemma [PTlimplv: 

|| Y, R (^ ^ Ofl2(*', * - t) - E °*' 0&(t)(J'> * - t)II < 6 • r 3 < 6 • r 2 



The first inequality follows from the inductive hypothesis and Lemma[T7] and the sec- 
ond inequality follows from the fact that r < 1. 

To obtain error bounds for g^ +1 over the entire interval [t — e, t], we consider the 
probability that the first transition actually occurs at time t — r: 



£ s fe+1 (2, e) < / e T - e 6 ■ t 2 dr < f 6t 2 rfr = 2 
Jo Jo 



e 3 . 



This completes the proof. □ 

H Proof of Lemma H 

We give the algorithm for the reachability player. The algorithm for the safety player is 
symmetric. For every location I € L, and time point r 6 [0, e], we define the quality of 
an action a as: 

We also define an operator that compares the quality of two actions. For two actions a\ 
and a-2, we have a\ -<J ai if and only if qj{a\) < q[(a2), and we have a% -<J a-i if and 
only if qj{ai) < qj(a 2 ). 

Algorithm Q] shows the key component of our algorithm for computing the ap- 
proximation p-2 during the interval [t — e,t]. The algorithms outputs a list O con- 
taining pairs (a, r), where a is an action and t is a point in time, which represents 
the optimal actions during the interval [t — e,t]: if the algorithm outputs the list 
O = ((tli, T i), {a>2,T~2), ■ ■ ■ , (a n , T n )), then a± maximises Equation (|2l) for the inter- 
val [t — T2,t — ri], a<z maximises Equation (0 for the interval t — T3, t — T2], and so 
on. 

The algorithm computes as follows. It begins by sorting the actions according 
to their quality at time t. Since a\ maximises the quality at time t, we know that a\ is 
chosen by Equation (0 at time t. Therefore, the algorithm initialises O by assuming 
that ai maximises Equation for the entire interval [t — e,t]. The algorithm then 
proceeds by iterating through the actions (02, 0,3, . . . a m ). 

We will prove the following invariant on the outer loop of the algorithm: if the first i 
actions have been processed, then the list O gives the solution to: 

-p 2 {l,T,i)= max R(l,a, I') ■ (pi(Z',t) -pi(l,r)) ■ (16) 

ae(a 1 ,a 2 ,...,a i ) 



Algorithm 1 BestActions 

Sort the actions into a list (01, 0,2, . . . , a m ) such that a% ^? ttj+i for all i. 

O:=((ai,0)>. 

for i := 2 to m do 

(a, r) := the last element in O, 
if a -<f aj then 
while true do 

x := the point at which qf(a) = qf(ai). 
if a; > r then 

Add (cij, k) to the end of O. 
break 
else 

Remove (a, r) from O. 
(a, t) := the last element in O. 
end if 
end while 
end if 
end for 
return O. 



In other words, the list O would be a solution to Equation if the actions 
(ai+i, aj+2, • ■ • im) did not exist. Clearly, when i = m the list O will actually be a 
solution to Equation ||7}. 

We will prove this invariant by induction. The base case is trivially true, because 
when i = l the maximum in Equation ([ToT i considers only 01, and therefore a\ is 
optimal throughout the interval [t — e,i\. We now prove the inductive step. Assume 
that O is a solution to Equation ([Tol l for i — We must show that Algorithm[T]coiTectly 
computes O for i. Let us consider the operations that Algorithm Q] performs on the 
action a,;. It compares <2j with the pair (a, t), which is the final pair in O, and one of 
three actions is performed: 

• If a, -<i a, then the algorithm ignores aj. This is because we have <Zj -<° ai, which 
means that ai is worse than ax at time t, and we have at -<f a, which implies that 
a, is worse than a at time t — e. Since gf(a,) is a linear function, we can conclude 
that never maximises Equation (jTj during the interval [t — £, t]. 

• If x, which is the point at which the functions qf(a) and qf{ai) intersect, is greater 
than r, then we add (ai, x) to O. This is because the fact that gf(aj) and qf(a) are 
linear functions implies that a, cannot be optimal for every time r' < r. 

• Finally, if a; is smaller than r, then we remove (a, r) from O and continue by 
comparing ai to the new final pair in O. From the inductive hypothesis, we have 
that a is not optimal for every time point t' < r, and the fact that x < r and the 
fact that qf (ai) and of (a) are linear functions implies that a^ is better than a for 
every time point r' > r. Therefore, a can never be optimal. 

These three observations are sufficient to prove that Algorithm[T]correctly computes O, 
and O can obviously be used to compute the approximation p2- The following lemma 
gives the time complexity of the algorithm. 



Lemma 20. Algorithm\I\runs in time 0(\£\ log \S\). 



Proof: Since sorting can be done in 0(|i7| log \S\) time, the first step of this algo- 
rithm also takes 0(|2?| log We claim that the remaining steps of the algorithm 
take (3(1171) time. To see this, note that after computing a crossing point x, the algo- 
rithm either adds an action to the list O, or removes an action from O. Moreover each 
action a can enter the list O at most once, and leave the list O at most once. Therefore 
at most 2 • \S\ crossing points are computed in total. □ 
We now complete the proof of Lemma|7] In order to compute the approximation pi, 
we simply run Algorithm Q] for each location I E L, which takes 0(\L\ ■ \S\ log \S\) 
time. Finally, we must account for the time taken to compute the approximation pi, 
which takes 0(|.M|) time, as argued in Theorem Q] Therefore, we can compute p2 in 
time 0(\M\ + \L\- \E\ log \S\). 



I Proof of Theorem H 



Proof: Lemma [5] gives the step error for double e-nets to be ^e 3 . Since there are j 
intervals, Lemma|2]implies that the global error of double e-nets is |e 3 ■ j = |e 2 • T. 
In order to achieve a precision of tt, we must select an e that satisfies |e 2 ■ T = tt. 

Therefore, we choose e — \ I?, which gives T ■ \ if- intervals. 

y i y ott 

The cost of computing each interval is given by Lemma [7] as 0(|.M| + \L\ ■ \S\ ■ 
log \ and there are T ■ \J~^ intervals overall, which gives the claimed complexity 

of 0(\M\ .T.J±+\L\-T-J%.\E\ log |i7|). □ 



J Proof of Lemma |8] 

Our arguments here are generalisations of those given for the claims made in Sec- 
tion[33] 



J.l Error bounds for the approximation p k 

The following lemma is a generalisation of Lemma|5] 

Lemma 21. For every k > 1, if we have £ (k, e) < c-e k+1 , then we have £ (k + 1, e) < 

k+2 ° fc 

Proof: The inductive hypothesis implies that ||pfe(t— t) — fp k 1 (t){^~ T )\\ — c 'T k+1 for 
every t e [0, e]. Therefore, for every pair of locations 1,1' E L and every r E [t — e, t] 
we have: 

||(p fc (r,t-r)- Pfe (M-r))-(4 +l(t) (/',t-r)-4 +i(t) (M-r))|| <2-c-r fe+1 . 



Since we are dealing with normed Markov games, we have Y^i>pl a > = 1 f° r 
every location I G L and every action a e A(V). Therefore, we also have for every 
action a: 



«, • (?*(*'. t-r)-p k (l,t- t)) 
- Y, ^ I') ■ (4 +1 (t) (l',t-r)- fl +i{t) (M - r) || < 2 . c ■ r^ 1 . 



veL 



I'eL 



This implies that \\p k (l,t-r)- f* {t) {l,t- r)[[ < 2 • cr fc+1 . 

We can obtain the claimed result by integrating over this difference: 

*(* + 1, r) = f ||p fc (Z, t - r) - i - r) || < ^ • c • r fc + 2 



/c + 2 



Therefore, the total amount of error incurred by Pk+i in [t — e, t] is at most -c-e k+2 . 

a 



J.2 Error bounds for the approximation g 2 

We will prove the claim for the reachability player, because the proof for the safety 
player is entirely symmetric. We begin by defining the approximation g2, which gives 
the time-bounded reachability probability when the reachability player follows the ac- 
tions chosen by pk- If aj is the action that maximises Equation ( fTOb at the location I for 
the time point r G [t — e,t] then we define r) as: 



-9k 



(l,T) = J2Q(l,a[,l')-g k (l',r) if I e L r , (17) 



I'eL 



-9k 



i k (l,r)= min V Q(Z, a, l') ■ g k {l',r) if I e L s . (18) 



y ' I'eL 



Our approach to proving error bounds for g k follows the approach that we used 
in the proof of Lemma [6] We will consider the following class of strategies: play the 
action chosen by p k for the first i transitions, and then play the action chosen by pi for 
the remainder of the interval. We will denote the reachability probability obtained by 
this strategy as g k , and we will denote the error of this strategy as £ l(k, e) := \\gl(t — 
e) — fp 2 u)(t — e)||. Clearly, as i approaches infinity, we have that gj, approaches g k , 
and £l(k, e) approaches £ s (k, e). Therefore, if a bound can be established on £ % s {k, e) 
for all i, then that bound also holds for £ s (k, e). 

We have by assumption that £(k, e) < c-e k+1 and £ s (k, e) < d-e k+1 , and our goal 
is to prove that £ s (k + 1, e) < 8 ^|° • e k+2 . We will prove error bounds on g k+1 by 
induction. The following lemma considers the base case, where i = 1. In other words, 
it considers the strategy that plays the action chosen by p k +i for the first transition, and 
then plays the action chosen by p k for the rest of the interval. 



Lemma 22. If e < 1, £(k,e) < c ■ e k+1 , and £ s (k,e) < d ■ s k+1 , then we have 

ic+3d 
k+2 



£}(k + l,e) < 4|±3d . e fc+2 



Proof: Suppose that the first discrete transition occurs at time t — r, where r € [0, e]. 
Let I be a location belonging to the reachability player, and let a\~ T be the action that 
maximises pu+i at time t — r. By definition, we know that the probability of moving to 
a location I' is given by R(Z, a*~ T , I'), and we know that the time-bounded reachability 
probabilities for each state I' are given by gk(l',t — r). Therefore, the outcome of 
choosing a t [ ~ T at time t — r is Yli'eL ^-Qi a z _T ' l')9k(l', t — r). If a* is an action that 

would be chosen by f * at time t — r, then we have the following bounds: 

1 J Pk+i(t) b 

|| £ R(i, a*" T , i')$*(J',t - r) - Yl R (*> a *> l VL +l (t)Q'> t-r)\\ 
v&L I'eL 

<|| ^ R(Z, o*- T , Z')rf>* ~ - E K^VO-C^O',* - + c ■ r fc+1 + d • 
I'eL I'eL 

<4 ■ c • r fc+1 + d ■ r k+1 

The first inequality follows from the bounds given for £(k, e) and £ s (k, e). The second 
inequality follows from the bounds given for £(k, e) and LemmafTTI 

Now suppose that / is a location belonging to the safety player. Since the reachability 
player will follow pk during the interval [t — T,t], we know that the safety player will 
choose an action a g that minimises: 



min y2Q(l,a,l')-g k (l',t-T). 



aeS(l) *■ 
y ' veL 



If a* is the action chosen by / at time t — r, then the following inequality is a conse- 
quence of Lemma[T7l 



^ R(l, a", l')g k (l', t - t) - E R(l, a*, l')f Ph+l{t) (l' , t-r)\\<3-d- r fc+1 
I'eL I'eL 



So far we have proved that the total amount of error made by g\ when the first 
transition occurs at time t — r is at most (4c + 3d) ■ r k+1 . To obtain error bounds for 
9k+i over me entire interval [t — e, t], we consider the probability that the first transition 
actually occurs at time t — r: 

£l(k + l,e)< f e T - e (4c + 3d)-T k+1 dr < f(4c+ 3d) ■ T k+1 dr = ^±^ £ k+\ 
Jo Jo k + 2 

This completes the proof. □ 

Lemma 23. // £\(k + 1, e) < . e k+2 for some k and £(k, e) < c ■ e k+1 , then 

£l+^k + l,e)<^d-e k + 2 



Proof: The structure of this proof is similar to the proof of Lemma [22] however, we 
must account for the fact that fl^tj follows g k+1 after the first transition rather than g k . 

Suppose that we play the strategy for g l k \ x , and that the first discrete transition 
occurs at time t — r, where r 6 [0, e). Let I be a location belonging to the reachability 



player, and let a, T be the action that maximises pk at time t — r.If a* is an action that 
would be chosen by /* 2 r t ) at time t — r, then we have the following bounds: 

|| ]T R(Z, a*~ r , Osi+itf', * - t) - E R ^ "'.OCwC' f - r )H 

<n e H-a, «r r , * - - E R ^ ^.oCwC' * - r )n + c • rfc+1 + ( 4c + 3d ) ■ rfe+2 

I'eL I'eL 

<4c- r fe+1 H r fc+i 

<(8c + 3d) -t^ 1 

The first inequality follows from the inductive hypothesis, which gives bounds on how 
far g\ +1 is from fp kl ny an d from the assumption about £(k, e). The second inequality 
follows from our assumption on £(k, e) and LemmafTTl and the final inequality follows 
from the fact that r < 1 and k > 2. 

Now suppose that the location I belongs to the safety player. Let a g be an action 
that minimises: 



min TQ(l,a,l')-gUi(l',t-T). 



K ' veL 



If a* is the action chosen by / at time t — r, then our assumption about £' l s (k + 1, e) 
and Lemma [T7limplv: 

iiE R ( i ' off 'Offj + 1 ia , > *-r)-E R ( | . a *»o^ lk+1 ( t )a / ,*-'r)ii < ^^^ +2 < (s^-^ 1 

Z'eL I'eL 

The first inequality follows from the inductive hypothesis and LemmafTTl and the sec- 
ond inequality follows from the fact that r < 1 and k > 2. 

To obtain error bounds for g% +1 over the entire interval [t — e, t], we consider the 
probability that the first transition actually occurs at time t — r: 

£l +1 (k+l,e) < ^V- £ (8c+3d)-T fc+1 dr < J\8c+3d)-T k+1 dr = ^±^. £ fc + 2 . 

This completes the proof. □ 
Our two lemmas together imply that £ l s (k + l,e) < ^tj • e fc+2 for all i, and 
hence we can conclude that £ s (k + l,e) < ■ e' £+2 . This completes the proof of 

Lemma[8] 



K Proof of Lemma |9] 

Proof: Lemma[8]implies that the step error of using a k-net is £(fc, e) < c ■ e k+1 for 
some small constant c < 1. Since we have y£ many intervals, Lemma|2]implies that the 

global error is T ■ e k . Therefore, to obtain a precision of it we must choose e — \pj- 

□ 



L Proof of Lemma [10] 



Proof: We will prove this claim by induction. For the base case, we have by definition 
that pi is a linear function over the interval [t — £,t]. For the inductive step, assume that 
we have proved thatpfc_i is piecewise polynomial with degree at most fc — 1. From this, 
we have that Yli'eL Q(^> a > ' Pk-i is a piecewise polynomial function with degree 
at most k — 1 for every action a, and therefore opt aeE n\ J2i>el QG> °j l')Pk-i(l', ■) is 
also a piecewise polynomial function with degree at most fc — 1. Since p*. is a piecewise 
polynomial function of degree at most fc — 1, we have that pk is a piecewise polynomial 
of degree at most fc. □ 



M Proof of Lemma H 

Proof: Let [t — T\,t — T2] be the boundaries of a piece in Pk-i- Since there can be at 
most ^\S(l)\ 2 actions in the CTMG, we have that optimum computed by Equation (TTOb 
must choose from at most i|i7(Z)| 2 distinct polynomials of degree fc — 1. Since each 
pair of polynomials can intersect at most fc times, we have that pk can have at most 
i • fc • |I?(Z)| 2 pieces for each location I in the interval [t — n,t — r 2 ]. Since Pk-i has c 
pieces in the interval [t — e,t], and \L\ locations, we have that p^ can have at most 
■ c ■ k ■ \L\ ■ \2J\ 2 during this interval. □ 



N Proof of Theorem|3] 

Proof: We know that double e-nets can produce at most \S\ pieces per interval, and 
therefore LemmafTTIimplies that triple e-nets can produces at most | • \L\ ■ \S\ 3 pieces 

per interval, and there are T ■ many intervals. To compute each piece, we must sort 

0(|I7|) crossing points, which takes time 0(|27| log Therefore, the total amount 

of time required to compute p 3 is 0(T ■ ■ \L\ ■ \S\ 4 ■ log \E\). 

For quadruple e-nets, Lemma [TTI implies that there will be at most 6 • \L\ 2 ■ \2J\ 5 

pieces per interval, and there at most T ■ many intervals. Therefore, we can repeat 

our argument for triple e-nets to obtain an algorithm that runs in time 0(T ■ \J~^ ■ \ L \ 2 ■ 

l^| 6 -log|I7|) □ 



O Collocation Methods for CTMDPs 



In the numerical evaluations of CTMCs, numerical methods like collocation techniques 
play an important role. We briefly discuss the limits of these methods when applied to 
CTMDPs, and in particular we will focus on the Runge-Kutta method. On sufficiently 
smooth functions, the Runge-Kutta methods obtain very high precision. For example, 
the RK4 method obtains a step error of (9(e 5 ) for each interval of length e. However, 
these results critically depend on the degree of smoothness of the functor describing 
the dynamics. To obtain this precision, the functor needs to be four times continuously 



differentiable [10 p. 157]. Unfortunately, the Bellman equations describing CTMDPs 
do not have this property. In fact, the functor defined by the Bellman equations is not 
even once continuously differentiable due to the inf and/or sup operators they contain. 

In this appendix we demonstrate on a simple example that the reduced precision is 
not merely a problem in the proof, but that the precision deteriorates once an inf or sup 
operator is introduced. We then show that the effect observed in the simple example can 
also be observed in the Bellman equations on the example CTMDP from Figure Q] 

Our exposition will use the notation given in 
|http : / /en . wikipedia . org/ wiki/Runge-Kutta_methods| (accessed 
08/04/2011). 

O.l A Simplified Example 

Maximisation (or minimisation) in the functor that describes the dynamics of the system 
results in functors with limited smoothness, which breaks the proof of the precision of 
Runge-Kutta method (incl. Collocation techniques). In order to demonstrate that this is 
not only a technicality in the proof of the quality of Runge-Kutta methods, we show on 
a simple example how the step precision deteriorates. 

Using the notation of http : //en . wikipedia . org/ wiki/Runge-Kutta_methods| 
(but dropping the dependency in t, that is y' — f(y)), consider a function y = y2) 
with dynamics — the functor / — defined by yl' = max{0, j/2} and y2' = 1. Note 
that the functor / is not partially differentiable at (0, 0) in the second argument, see 
Figure [5] 




-1 -0.5 0.5 1 -1 -0.5 0.5 1 

y2 y2 



Fig. 5. The left graph shows the variation of the first projection of the functor / (that 
is, of max{0, j/2}) in the second argument (that is, of y2). The right graph shows the 
respective partial derivation in direction y2 on for the values on this line. In the origin 
(0,0) itself, / is clearly not differentiable. 



Let us study the effect this has on the Runge-Kutta method on an interval of size h, 
using the start value y n = (0, —\h). Applying RK4, we get 

. fci = /((0,-|fc)) = (0,l), 
. fc 2 = /((0,0)) = (0,l), 



. Ass = /((0,0)) = (0,1), 

• A: 4 = /(((), i/i)) = (i/i,l), and 

• 2/n+l = I/n + + 2fc 2 + 2fc 3 + **) = (ft 2 / 12 , h/2). 

The analytical evaluation, however, provides (h 2 /8, h/2) which differs from the 
provided result by ^jh 2 in the first projection. Note that the expected difference in the 
first projection is in the order of h 2 if we place the point where max is in balance (the 
'swapping point' that is related to the point where optimal strategies change) uniformly 
at random at some point in the interval. 

Still, one could object that we had to vary both the left and the right border of the 
interval. But note that, if we take the initial value y(0) = (0, — 1) = yo, seek y(2), and 
cut the interval into 2n + 1 pieces of equal length h — 2 n+i 1 men this * s me middle 
interval. (This family contains interval lengths of arbitrary small size.) 

0.2 Connection to the Bellman Equations 

The first step when applying this to the Bellman equations is to convince ourselves that 
their functor F = Pi with &l = °pt Yl ■ ■ ■ i s indeed not differentiable. We use g 

for the arguments of F in order to distinguish it from the solution /, where f(t) is the 
time-bounded reachability probability at time t. 

For this, we simply re-use the example from Figure Q] The particular functor F is 
not differentiable in the origin: varying Fi R in the direction gi provides the function 
shown in Figure[7] showing that Fi R is not differentiable in the origin. 

(Due to the direction of the evaluation, this is the 'rightmost' point where the opti- 
mal strategy changes.) 

Again, differentiating F[ R (f(ti)) in the direction gi provides a non-differentiable 
function. (In fact, a function similar to the function shown in Figure[7] but with adjusted 
x-axis.) 

An analytical argument with e functions is more involved than with the toy exam- 
ple from the previous subsection. However, when the mesh length (or: interval size) 
goes towards 0, then the ascent of the e functions is almost constant throughout the 
mesh/interval. In the limit, the effect is the same and the error in the order of h 2 . 




1.5 



Fig. 6. yl and y2 from the solution of the ODE of the simplified example in the time 
interval [0,2]. 




Fig. 7. The left graph shows the variation of the first projection of the functor F in 
the argument gi at the origin. The right graph shows the respective partial derivation 
in direction gi on for the values on this line. In the origin itself, F is clearly not 
differentiable. 



